Last Quarter’s Review

Published

January 9, 2025

Before we start

  • We are expected to have installed R and RStudio, if not see the installing R section.

  • In the discussion section, we will focus on coding and practicing what we have learned in the lectures.

  • Office hours are on Tuesday, 11-12:30 Scott 110.

  • Questions?

Brief recap of the last quarter

Coding Terminology

Code Chunk

To insert a Code Chunk, you can use Ctrl+Alt+I on Windows and Cmd+Option+I on Mac. Run the whole chunk by clicking the green triangle, or one/multiple lines by using Ctrl + Enter or Command + Return on Mac.

print("Code Chunk")
[1] "Code Chunk"

Function and Arguments

Most of the functions we want to run require an argument For example, the function print() above takes the argument “Code Chunk”.

function(argument)

Data structures

There are many data structures, but the most important to know the following.

  • Objects. Those are individual units, e.g. a number or a word.
number = 1
number

word = "Northwestern"
word
[1] 1
[1] "Northwestern"
  • Vectors. Vectors are collections of objects. To create one, you will need to use function c().
numbers = c(1, 2, 3)
numbers
[1] 1 2 3
  • Dataframes. Dataframes are the most used data structure. Last quarter you spend a lot of time working with it. It is a table with data. Columns are called variables, and those are vectors. You can access a column using $ operator.
df = data.frame(numbers, 
                numbers_multiplied = numbers * 2)
df
df$numbers_multiplied
  numbers numbers_multiplied
1       1                  2
2       2                  4
3       3                  6
[1] 2 4 6

Data classes

We work with various classes of data, and the analysis we perform depends heavily on these classes.

  • Numeric. Continuous data.
numeric_class = c(1.2, 2.5, 7.3)
numeric_class
class(numeric_class)
[1] 1.2 2.5 7.3
[1] "numeric"
  • Integer. Whole numbers (e.g., count data).
integer_class = c(1:3)
class(integer_class)
[1] "integer"
  • character. Usually, represent textual data.
word
[1] "Northwestern"
class(word)
[1] "character"
  • Factor. Categorical variables, where each value is treated as an identifier for a category.
colors = c("blue", "green")
class(colors)
[1] "character"

As you noticed, R did not identify the class of data correctly. We can change it using as.factor() function. You can easily change the class of your variable (as.numeric(), as.integer(), as.character())

colors = as.factor(colors)
class(colors)
[1] "factor"

Libraries

Quite frequently, we will use additional libraries to extend the capabilities of R. I’m sure you remember tidyverse. Let’s load it.

library(tidyverse)

If you updated your R or recently downloaded it, you can easily install libraries using the function install.packages().

Pipes

Pipes (%>% or |>) are helpful for streamlining the coding. They introduce linearity to the process of writing the code. In plain English, a pipe translates to “take an object, and then”.

numbers %>%
  print()
[1] 1 2 3

Describing Data

First task, install vdemdata in your console. Then, load the library.

library(vdemdata)

This is the V-Dem dataset. For your reference, their codebook is available here.

The dataset is huge! Be careful

nrow(vdem)
ncol(vdem)
[1] 27734
[1] 4607

Imagine you are interested in the relationship between regime type and physical violence. Let’s select the variables we will work with. Quite unfortunately, the names of the variables are not as straightforward. The regime index is e_v2x_polyarchy_5C and Physical violence index is v2x_clphy.

violence_data = vdem %>%
  select(country_name, year, e_v2x_polyarchy_5C, v2x_clphy) 

Let’s rename the variables so it’s easier to work with them.

violence_data = violence_data %>%
  rename(regime = e_v2x_polyarchy_5C,
         violence = v2x_clphy)

Now, analyze the regime data. We can describe regime data using various statistics. Let’s check the min score for the regime.

min(violence_data$regime, na.rm = T)
[1] 0

Check the max score for the regime variable below.

...(violence_data$regime, na.rm = T)

Check the average score for the regime variable below.

mean(..., na.rm = T)

Finally, use the summary() function.

summary(violence_data$regime)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.0000  0.0000  0.0000  0.2224  0.2500  1.0000    1139 

mutate(dem = case_when(ifelse(e_v2x_polyarchy_5C >= 0.5, 1, 0)))

Statistic Function Example Usage
Minimum min() min(x)
Maximum max() max(x)
Mean mean() mean(x)
Median median() median(x)
Standard Deviation sd() sd(x)
Variance var() var(x)
Sum sum() sum(x)
Summary summary() summary(x)

Sampling

Base R vs Tidyverse

Useful functions, sample()

Visualizations

Tidyverse basics (mutate, filter, select, summarize, etc) Descriptive statistics Confidence intervals

Function Description
select() Selects specific columns from a data frame
mutate() Adds new variables or modifies existing ones
filter() Filters rows based on specified conditions
group_by() Groups data by one or more variables for subsequent operations
summarize() Summarizes data by applying a function (e.g., mean, sum)
case_when() Modifies a variable based on conditional logic
rename() Renames columns in a data frame

You can check how to use these commands in this scipt, or you can simply use the help option ?function().

Helpful to review

Installing R and RStudio

First, we need to install R. Click the button below and click “Download and Install R”. Choose your OS. For Windows you need to download “base”; for MacOS and Linux you have to choose the version of your OS. Install.

Download R
Step one

For windows:

Second, we need to install RStudio. Click the button below and click “Download RStudio Desktop”. You will be redirected to your version automatically. Install.

Download RStudio
Step two